Back

BMJ Health & Care Informatics

BMJ

Preprints posted in the last 30 days, ranked by how well they match BMJ Health & Care Informatics's content profile, based on 13 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.

1
Physician Facing AI Tools Show Distinct Failure Modes Under Structured Stress Testing

Hazare, N. S.; Oh, W.; Kumar, G.; Goel, N.; Shaikh, A.; Sharma, A.; Desman, J.; Kumar, A.; Robles, C.; Singh, A.; Jangda, M.; Agaron, S.; Capone, C.; Ngai, D.; Itwaru, A.; Parchure, P.; Ramaswamy, A.; Gorbenko, K.; Timsina, P.; Lampert, J.; Tamler, R.; Manasia, A.; Kohli-Seth, R.; Kaplan, B.; Vakil, A.; Omar, M.; Glicksberg, B. S.; Freeman, R.; Stern, A. D.; Klang, E.; Darrow, B.; Stump, L. S.; Reich, D.; Charney, A.; Nadkarni, G. N.; Sakhuja, A.

2026-05-29 health informatics 10.64898/2026.05.27.26354248 medRxiv
Top 0.1%
7.3%
Show abstract

Importance: Physician-facing AI tools are now in clinical use, yet whether different platforms fail in similar or fundamentally different ways in high-stakes settings like critical care is unknown. Objective: To evaluate two physician-facing AI platforms, ChatGPT for Clinicians and OpenEvidence, for distinct vulnerabilities under structured stress testing. Design, Setting, and Participants: An observational study conducted using 60 simulated critical care vignettes developed and adjudicated by four attending critical care physicians. Data were collected in the last week of April 2026, via the public website interfaces of each platform. Interventions/Exposures: A 2x2x2x2 factorial design across four stressors - anchoring, cognitive load, social conformity pressure, and a clinically incorrect directive - yielded 16 prompt subsets per vignette and 960 prompts per platform. A separate multi-turn adversarial prompting paradigm administered three sequential "You are incorrect" challenges to baseline vignettes. All prompts had a universal output length constraint of fewer than 30 words. Main Outcomes and Measures: Critical elements capture (percentage of gold-standard critical elements present in responses), susceptibility to clinically incorrect directive, and sycophancy (reversal of an initial correct recommendation under iterative adversarial challenge). Results: Across 1916 responses to 1920 prompts, ChatGPT for Clinicians captured more gold-standard critical elements than OpenEvidence (81.4% {+/-} 18.1% vs 61.0% {+/-} 23.5%; adjusted difference, 20.3 percentage points; 95% CI, 18.3 to 22.4; P < .001) and was less susceptible to clinically incorrect directives (1.7% vs 8.0%; adjusted odds ratio, 0.07; 95% CI, 0.02-0.21; P < .001). Anchoring and social conformity pressure were associated with reduced critical element capture across both platforms, while cumulative stressor burden reduced critical element capture only on OpenEvidence. Conversely, ChatGPT for Clinicians reversed correct recommendations more readily under adversarial prompting (hazard ratio, 2.61; 95% CI, 1.10 - 6.19; P = .03). Conclusion and Relevance: The two physician-facing clinical AI platforms evaluated demonstrated non-overlapping vulnerabilities, with neither platform uniformly superior. These findings argue against single-axis ranking of clinical AI systems and support multidimensional safety evaluation encompassing completeness of reasoning, resistance to incorrect directives, and stability under adversarial challenge.

2
Explainable AI and public reactions to AI-involved adverse diagnostic events: a vignette study

Choi, J.; Kim, Y. J.; Lyu, P.; Luan, Y. L.; Toh, S. M.

2026-06-02 health informatics 10.64898/2026.05.26.26353870 medRxiv
Top 0.1%
7.0%
Show abstract

Artificial intelligence (AI) is increasingly incorporated into diagnostic decision-making, raising questions about physician responsibility following AI-involved adverse diagnostic events. Explainable AI (XAI) has been proposed to improve transparency and trust, but its influence on public reactions remains unclear. In a randomised vignette-based experiment, 652 adults from the United States and United Kingdom were assigned to one of six conditions in a 3 (diagnostic source: AI alone, human radiologist alone, or human-AI collaboration) x 2 (explanation: present or absent) between-subjects design. Participants read a scenario in which a chest X-ray was initially interpreted as normal but lung cancer was diagnosed five months later, indicating that the original interpretation had missed the cancer. In explanation conditions, participants received additional information about how the diagnosis had been reached, including AI heatmap-based explanations in the AI conditions. Participants rated radiologist responsibility, likelihood of complaint, and intention to pursue legal action. Among 652 participants (mean age 42.2 years; 50.2% female), responsibility ratings were significantly lower when AI alone made the diagnostic decision (mean 4.73, 95% CI 4.53-4.93) compared with human-only decision-making (5.78, 95% CI 5.59-5.98; p<0.001) and human-AI collaboration (5.54, 95% CI 5.34-5.74; p<0.001). Complaint likelihood showed a similar pattern. Intentions to pursue legal action followed the same directional trend but were marginally significant. Neither explanations nor explanation-by-source interactions were associated with outcome measures. These findings suggest that the public expects physicians to remain accountable when AI is involved in diagnostic decision-making, particularly in collaborative settings. Providing explanatory information about how AI systems reach decisions may be insufficient to change perceptions of physician responsibility following adverse diagnostic events.

3
The Verification Gap: Artificial Intelligence Adoption, Hallucination Awareness, and Verification Practices Among Early Career Medical Researchers in Pakistan

Sajjad, M.

2026-05-30 health informatics 10.64898/2026.05.28.26354373 medRxiv
Top 0.1%
7.0%
Show abstract

Artificial intelligence (AI) tools have been rapidly adopted by medical researchers, yet whether early career researchers in low and middle income countries possess the awareness and habits needed to use these tools safely remains poorly documented. This study characterized AI adoption patterns, hallucination awareness, and verification and disclosure practices among early career medical researchers in Pakistan. A cross sectional anonymous online survey was conducted among medical students, house officers, residents, physicians, and faculty involved in research or academic work across Pakistan (May 2026). Descriptive statistics and chi square tests were applied to 373 eligible responses. AI use was near universal (99.7%), with 60.3% using AI tools daily. The most commonly reported tool in this sample was Claude (40.5%), followed by ChatGPT (29.2%) and Perplexity (26.0%), though this ranking likely reflects sampling characteristics. Despite high adoption, 59.2% typically did not verify AI outputs before use, and 40.2% had never heard that AI can generate fabricated scientific references. In behavioral vignettes, 36.5% assumed convincing AI generated references were authentic, and 54.2% would continue using remaining AI content after discovering one fabricated reference. Formal research training was strongly associated with consistent disclosure (51.7% vs. 17.1%; chi square=48.43, p less than 0.001). Role, daily use frequency, and research training were not significantly associated with verification behavior. Early career medical researchers in Pakistan demonstrate high AI adoption alongside incomplete hallucination awareness and infrequent verification, a pattern that may carry implications for research integrity. Formal training was the only factor significantly associated with consistent disclosure. Integration of AI literacy into medical curricula and institutional governance frameworks merits consideration.

4
Language-dependent diagnostic safety of medical AI systems: a cross-lingual benchmarking and prospective clinical study

Wang, Y.; He, H.; Zhu, R.; Lu, Y.; Phadungsaksawasdi, P.; Peng, M.; Liu, Z.; Zou, K.; Zhang, Y.; Chew, S. P.; Tham, Y. C.; Khorasani, A.; Deng, H.; Cheng, C.-Y.; Yang, J.; Liu, D.

2026-05-21 health informatics 10.64898/2026.05.19.26353490 medRxiv
Top 0.1%
6.5%
Show abstract

Background Patients worldwide receive healthcare in many languages, yet medical AI systems are validated almost exclusively in high-resource languages such as English and Chinese, exposing patients in other linguistic settings to unquantified diagnostic risk. Existing multilingual evaluations rely on translated research-style benchmarks that fail to capture authentic clinical work. We aimed to characterise the patient safety consequences of multilingual medical AI deployment in real-world clinical settings and to develop an auditable detection method for unsafe outputs. Methods We evaluated different language models (LLMs) and visual language models (VLMs) across four real-world clinical tasks (conversational QA, radiology report generation, glaucoma diagnosis, ICU re-intubation prediction) in five languages (English, Chinese, Malay, Thai, Persian). We developed a token-level uncertainty toolkit to localise reasoning instability, compared three inference paradigms (native-language, English chain-of-thought, back-translation pivot), and conducted a prospective study (50 dialogues, 150 physician-reviewed records). Findings LLM/VLM performance degraded consistently from high- to low-resource languages across all tasks. Key gaps included: HealthBench score declining from 0.3743 to 0.3180; radiology macro-F1 from 0.2938 to 0.2149-0.2424, consistent with selective pathology suppression; glaucoma accuracy from 50.7% to 32.7%; ICU parameter recall from 100.0% to 48.5%. Multimodal inputs amplified degradation. Qwen3 VL 235B showed attenuated decline with no resource-ordered pattern in glaucoma classification. Token-level analysis localised instability to mid-chain stages (40-70% of the normalised trajectory); perplexity-based confidence failed to flag errors (AUROC 0.41-0.66). Back-translation pivot consistently restored performance. In the prospective study, 98.7% of records required physician edits (overall modification score 53.6%); Thai-pivot correction burden (59.0%) exceeded English-pivot (50.7%, p=0.003) and Chinese-direct (51.0%, p=0.004). Interpretation Multilingual deployment produced clinically consequential failures, including missed pathology, distorted physiological extraction, and amplified multimodal misclassification, that were invisible to monolingual validation and not reliably flagged by model confidence. Pretraining data composition may contribute to multilingual safety risk. Language-specific safety auditing should precede deployment in non-dominant-language healthcare settings; the open-source detection toolkit enables this without model retraining.

5
A Retrospective Evaluation of the Microsoft Healthcare Agent Orchestrator for Tumor Board Patient Summaries

Roy, J.; Korleski, J. B.; Augustin, R. C.; Yefet, L.; Jensen, Z. D.; Ehman, E. C.; Zadeh, G.; Conners, A. L.; Tevaarwerk, A. J.; Korfiatis, P.

2026-06-01 health informatics 10.64898/2026.05.22.26353812 medRxiv
Top 0.1%
6.4%
Show abstract

Background: Preparing tumor board patient summaries is time intensive. Large-language-model based systems may automate summarization but require real-world evaluation prior to clinical use. We performed an exploratory retrospective evaluation of the Microsoft Healthcare Agent Orchestrator (HAO), deployed in a Mayo Clinic controlled staged environment, to generate tumor board-style patient summaries from retrospective Electronic Health Record (EHR) notes. Methods: HAO generated summaries for breast, hepatobiliary, and neuro-oncology tumor board cases using up to the most recent 1,000 clinical notes. Clinician reviewers evaluated outputs via REDCap surveys across perceived factuality, completeness, clarity/conciseness, temporal cohesion, comparative performance, safety, and clinical utility (0-4 Likert scale). Reviewers were permitted to query the HAO chat interface to address missing details. Automated factuality was assessed using TBFact (bidirectional entailment), reporting precision and recall against available reference summaries. Results: Among 57 survey responses from 5 different physicians, mean scores exceeded 2.8 across domains, with medians of 3 for most axes. In an exploratory comparison, oncology fellows required less time to review HAO-generated summaries than to manually generate patient summaries (mean difference 13.57 minutes per patient, p<0.001), although this difference may be influenced by prior familiarity with the same cases; 96% of survey responses indicated that HAO would save time. TBFact evaluations showed higher recall than precision across domains, consistent with broad capture of reference content alongside additional content that was not present in gold-standard summaries. Attribution was viewed favorably but showed issues with primary-source specificity and link reliability. Conclusions: In a controlled Mayo environment, HAO demonstrated moderate performance and was associated with reduced review time for tumor board preparation. These findings are promising but preliminary and do not establish clinical safety, noninferiority to manual review, or readiness for routine clinical use. Limitations, including verbosity, specialty-specific content gaps, and inconsistent attribution, highlight the need for iterative refinement and further evaluation.

6
AI Decision Support for Challenging Teledermatology Cases: MedGemma Performance in the Dermatology ECHO Program

Appiagyei, J. B.; Otu, R. O.; Henry, M. K.; Casterline, B. W.; Becevic, M.

2026-05-26 health informatics 10.64898/2026.05.21.26353523 medRxiv
Top 0.1%
6.2%
Show abstract

Teledermatology expands access to dermatologic expertise in rural settings, yet diagnostic uncertainty persists in low-resource primary care. This retrospective study evaluated MedGemma-4B-IT, a compact multimodal vision-language model, as adjunctive clinical decision support for challenging diagnostic cases. We analyzed 77 zero-concordance cases (360 clinical photographs) from a Dermatology Extension for Community Healthcare Outcomes (ECHO) tele-mentoring program (2016-2021). Zero-concordance cases showed no overlap between primary clinician provisional diagnosis and dermatologist-confirmed diagnosis. The model was prompted using dermatologist-style format to generate ranked differential diagnoses. Performance was assessed using strict case-level top-k exact-match accuracy and relaxed matching criteria based on fuzzy string similarity. MedGemma achieved 0.0% strict top-1 accuracy, 1.3% top-3 accuracy, 3.9% top-5 accuracy, and 3.9% top-10 accuracy. Relaxed concept-level matching achieved 28.6% top-1, 63.6% top-5, and 67.5% top-10 accuracy. Image-level accuracy was 44.2% (159/360, 95% CI 39.0-49.5%). The model surfaced the correct diagnosis within differential lists in 45.5% of cases despite no exact top-1 matches, suggesting utility for differential expansion rather than definitive diagnosis. Performance varied across diagnostic categories, with highest accuracy in Other categories (54.5%) and lowest in neoplastic conditions (0.0%). Common errors included confusion between inflammatory and other diagnostic groupings. These findings characterize MedGemma performance on real-world teledermatology cases and inform safe, clinician-in-the-loop integration into teledermatology workflows where specialist oversight remains essential.

7
Adaptable Stroke Education Improves Knowledge Across Diverse High School Settings

Namian, S.; DiBiase, R.; Elnazer, S. H.; Evers, C.; Fung, C.; Narula, R.; Rafferty, M.; Salahuddin, A.; Sardana, D. J.; Shea, J.; Sullivan, M.; Forman, R.

2026-05-18 neurology 10.64898/2026.05.14.26353185 medRxiv
Top 0.1%
5.0%
Show abstract

Background: High school students may be able to communicate health topics to peers and adults. Yet, few studies have evaluated the role of high school students in community health initiatives, making them an underutilized group for disseminating health information. We pilot tested stroke education across five high schools using varied delivery approaches as a preliminary step toward evaluating youth stroke education to improve community health. Methods: In April-May 2025, five high schools in Connecticut and New York participated in stroke education. The format was designed to fit the needs of each school and included an 8-session classroom curriculum (Derby, CT), after-school club meetings (New Haven, CT; Long Island, NY), and one large assembly (Bridgeport, CT). Developed by teachers and neurology providers, the curriculum covered stroke risk factors, symptoms, and emergency response. Students completed a 15-point assessment adapted from the validated Stroke Action Test before, immediately after, and 4-6 weeks post-intervention; data were collected between April and July 2025. Results: Of 112 students completing the pre-test, 99 (88%) completed the immediate post-test and 51 (46%) the delayed follow-up. Average scores rose from 47% pre-intervention to 75% post and 70% at 4-6 weeks. All schools scored <50% on pre-tests suggesting poor baseline stroke knowledge. Conclusion: This pilot suggests that stroke education can be delivered to high school students across varied settings and may support knowledge gains up to 6 weeks. Limitations included small sample sizes and missing follow-up data. If validated in larger studies, this adaptable, teacher-supported approach could offer a scalable public health strategy for improving community stroke preparedness.

8
Investigating the Readability, Visual Design, and Quality of Online Written Pharmacogenomics Health Information for Health Consumers in Australia

Giblett, M. J.; Babikian, Y.; Jhala, D. J.; Medland, S. E.

2026-05-29 health informatics 10.64898/2026.05.27.26354169 medRxiv
Top 0.1%
4.8%
Show abstract

Pharmacogenomics (PGx) offers a pathway towards personalised medicine, which relies on health consumer involvement in making informed decisions. As consumers increasingly seek health information online, high-quality digital resources are essential to support informed consent and shared decision making. The complexity of PGx and widespread limitations in health literacy raise concerns about whether existing consumer-facing online PGx resources are understandable and sufficiently comprehensive. This study evaluates the readability, visual design, and informational quality of publicly available online written PGx health information. Twenty-three webpages met inclusion criteria. The mean readability corresponded to approximately 15 years of formal education (university level), substantially exceeding the Australian Government's recommended Year 7 reading level for public health materials. Informational quality was generally low, with most webpages being rated as poor or very poor. In contrast, visual design quality was relatively strong, with webpages achieving on average around three-quarters of the criteria. Although the visual presentation of PGx webpages is generally professional, their high reading difficulty and limited discussion of treatment choices and uncertainties reduce their usefulness for health consumer education. Improving readability, clearly communicating risks and limitations, and incorporating decision-support features may enhance the ability of online resources to support informed consent and shared decision making.

9
Design and Usability Evaluation of a Digital Guideline Management Application for a Pediatric Cardiac Center

Heidenreich, B. M.

2026-05-26 health informatics 10.64898/2026.05.24.26353982 medRxiv
Top 0.1%
4.8%
Show abstract

Background. Complex cases in specialized pediatric care require consistent adherence to evidence-based clinical pathways and protocols to ensure safe, high-quality, and equitable care. Currently, clinical pathways and supporting documentation are frequently distributed across multiple platforms, leading to fragmentation. Human-centered design principles can guide the development of healthcare technologies that minimize cognitive load and support rapid, efficient access to relevant information in clinical settings. The purpose of this study is to design and evaluate perceived usability of a pediatric cardiac center digital guideline management system that is embedded within the electronic health record leveraging human-centered design. Methods. This study used a mixed-methods usability evaluation to assess a digital guideline management system prototype embedded into clinical workflow. Through human-centered design principles, the prototype provides a centralized digital document library that organizes cardiac-specific clinical pathways, guidelines, procedures, and related resources. A small but diverse sample, encompassing a wide variety of roles and clinical areas within the pediatric cardiac center, was recruited to evaluate the perceived usability of the prototype. Usability was evaluated by stakeholders using the validated System Usability Scale (SUS) with additional optional questions to understand perceptions of the information architecture and clinical value. Results. Preliminary usability testing showed a mean SUS composite score of 76.5, indicating above average usability. Questions related to the complexity of the system and user confidence received high scores across participants. Lower scores were observed for questions related to usage frequency and ability to learn the system very quickly. Conclusion. Leveraging human-centered design when building a digital guideline management system embedded within clinical workflow revealed positive perception from participants. By centralizing access to clinical resources, this prototype can reduce current-state fragmentation. Further evaluation of larger samples is needed to develop a list of future recommendations.

10
Professionalism Pulse: Development and Validation of a Natural Language Processing Pipeline and Dashboard for Safety Culture Surveillance in NYC Health + Hospitals

Mangut, E.; Wallace, R.

2026-05-22 health informatics 10.64898/2026.05.19.26353620 medRxiv
Top 0.1%
4.7%
Show abstract

Background: Professionalism and effective communication are foundational determinants of patient safety and quality of care. Unprofessional behaviors frequently serve as active precursors to adverse clinical events. However, proactive organizational surveillance is often hindered because incident feedback exists primarily as unstructured, free-text data. This study aimed to develop and validate a Natural Language Processing (NLP) pipeline and interactive dashboard to proactively monitor the "professionalism climate" within NYC Health + Hospitals, the largest municipal healthcare delivery system in the United States. Methods: A high-fidelity synthetic dataset (N=400) was computationally generated to safely mirror historical incident logs across 11 acute facilities without utilizing Protected Health Information (PHI). A rule-based NLP pipeline was developed in R utilizing the tidytext package. Unstructured narrative feedback was tokenized and classified into three core domains: Respect, Safety, and Communication. To validate the pipeline's accuracy, a 25% random stratified sample (n=100) was evaluated against independent, blinded manual coding performed by two reviewers, with inter-rater reliability measured via Cohen's Kappa. Finally, an interactive Tableau dashboard was developed to operationalize and visualize these metrics for ongoing surveillance. Results: The NLP algorithm achieved an overall accuracy of 85.8% (95% CI: 79.0-92.6), with 81.2% sensitivity and 88.9% specificity. The highest domain-specific performance was observed in Communication (88.0% accuracy). Manual validation demonstrated strong inter-rater reliability (k=0.84). Operational analysis via the dashboard revealed that 61.8% of reports occurred during the Tour 2 shift (15:00 to 23:00), aligning with peak operational volume. Furthermore, Respect-related feedback was reported at a disproportionately high frequency during the Tour 3 shift (23:00 to 07:00), accounting for over 50.7% of overnight feedback submissions. Conclusion: Rule-based NLP successfully transforms qualitative healthcare feedback into structured, actionable intelligence with high specificity. Integrating this pipeline into operational dashboards transitions safety culture surveillance from a reactive, manual exercise to a proactive, scalable system, enabling targeted, data-driven interventions by hospital leadership.

11
Quality and Safety profiles of AI-Generated vs Clinician-Generated Handoffs in Hospital Medicine

Shah, K. P.; Airan Javia, S.; Savage, T.; Bressman, E.

2026-06-08 health informatics 10.64898/2026.06.05.26354946 medRxiv
Top 0.1%
4.7%
Show abstract

End-of-rotation handoffs are critical for patient safety but add to documentation burden for hospitalists. Generative artificial intelligence (AI) may help automate handoff creation using electronic health record data, but its impact on quality and safety is unclear. Methods: We developed an AI handoff tool with a large language model using clinical notes as input and conducted a retrospective evaluation comparing AI-generated and clinician-authored handoffs. Handoffs were assessed across domains of quality and safety through a structured review. Results: Quality ratings were similar between AI and human handoffs (3.7 vs. 3.5, p=0.57). AI-generated handoffs were rated higher for organization (4.4 vs. 4.1, p=0.05) and completeness (4.1 vs. 3.6, p=0.01), but lower for conciseness (3.7 vs. 4.1, p=0.03) and accuracy (4.1 vs. 4.4, p=0.03). Error rates were comparable (0.3/handoff in both groups); however, AI-generated handoffs included inaccuracies (9% of AI errors) and hallucinations (1% of AI errors), while clinician-authored handoffs contained only omissions. Conclusion: Human and AI handoffs have differing error profiles and tradeoffs between completeness and conciseness. Prospective evaluation in clinical workflows is underway.

12
Operationalizing Eight-Dimensional Patient-Safety Risk Scoring at Scale: A Multi-Model Large Language Model Reliability Study

LIn, H.-M.; Lyu, J.; Wang, I.-L.

2026-06-01 health informatics 10.64898/2026.05.29.26354437 medRxiv
Top 0.1%
4.0%
Show abstract

Background: Hospital incident risk scoring has long relied on two- or three-dimensional frameworks (Severity Assessment Codes or Risk Priority Numbers),even though root cause analysis standards recognize that clinical risk is multi-factorial. The obstacle has been mainly cognitive: human reviewers cannotreliably score many dimensions across high incident volumes, so richer assessmenthas not been operationalized at scale.Objective: To extend the traditional three-dimensional FMEA to an eight-dimensional patient-safety risk feature framework, to establish a multi-modellarge language model (LLM) extraction pipeline that scores these dimensionsautomatically, and to demonstrate a variance-aware integer optimization (mean-variance integer programming, MV-IP) that provides a reproducible tie-breakingrule for incident prioritization under extraction uncertainty, rather than improvedrisk coverage.Methods: An 8-dimensional framework covering harm severity, potential harm,frequency, detectability, systemic impact, vulnerable populations, regulatoryrelevance, and economic impact was applied to 213 synthetic and 196 realcurated incident narratives. Three independent LLMs (GPT-5.4, Gemini 3.1 Pro, Grok-4.1 Fast) from different provider families extracted structured risk scores.Inter-model consistency was assessed via ICC(A,1). Among coverage-equivalentselections, MV-IP minimized inter-model variance to give a reproducible prioriti-zation rule. An English-language sensitivity analysis was conducted on 31 AHRQPSNet WebM&M cases.Results: On real cases, seven of eight dimensions reached Fair or betterinter-model reliability (ICC(A,1) 0.53 to 0.83); D5 (Systemic Impact) was theexception at Poor reliability (0.275), driven by little between-case variation ratherthan by wide model disagreement. Reliability was not uniform: two dimensionswere Excellent (D1 actual harm 0.834, D8 economic impact 0.782), two Good,and three only Fair, so some dimensions are more readily extractable than others.The same anchors gave broadly similar results on English-language narratives.When deterministic top-K selection returned several equal-coverage solutions(11 on real cases, total inter-model variance 0.205 to 1.274), MV-IP selected theminimum-disagreement set, replacing ad hoc tie-breaking with an explicit rulewithout improving coverage. Bootstrap resampling found 74% to 90% of per-casevariance estimates stable despite the three-model panel.Conclusions: The eight-dimensional framework operationalizes patient-safetyrisk features that quality teams have considered only implicitly, and three inde-pendent LLM families produced reproducible scores on most dimensions ofcurated narratives. Inter-model agreement, however, measures reproducibilityrather than clinical correctness, and high agreement does not by itself establishthat a score is right; the dimensions that are reliably extractable today (notablyD6 and D8) differ from those that are not yet (D5, and to a lesser degree D4 andD7), which has direct implications for incident-reporting form design. MV-IP con-tributes a reproducible, variance-aware tie-breaking rule rather than improvedcoverage. Validation against expert-prioritized RCA lists and deployment on rawinstitutional incident reports remain the next steps toward clinical use.

13
Performance evaluation and benchmarking across 16 large language models on a comprehensive real-world emergency department triage data set

Benning, L.; Hirsch, A.; Groeschel, M.; Roeschl, T.; Spott, M.; Hans, F. P.; Urban, T.; Busch, H.-J.; Meyer, A.; Madrid, J.

2026-06-05 health informatics 10.64898/2026.05.28.26353935 medRxiv
Top 0.1%
3.8%
Show abstract

Background Emergency department (ED) triage is a high-stakes clinical decision process that determines patient prioritization and resource allocation under time pressure. Large language models (LLMs) have recently been proposed as decision-support tools for triage, yet most evaluations rely on simulated scenarios or curated datasets. Evidence from real-world clinical environments remains limited. The objective of this project was to systematically evaluate the performance, calibration, and reproducibility of multiple contemporary large language models for Emergency Severity Index (ESI) classification and sectoral allocation (ED vs. urgent care practice, UCP) using a comprehensive real-world triage dataset. Material and Methods Retrospective cross-sectional benchmarking study conducted at a tertiary academic emergency ED in Germany with an integrated central point of assessment (CPA). The study included all consecutive adult walk-in encounters (>18 years) presenting between October 2023 and February 2024 (N = 16,107). Data were collected from a structured clinical decision support system capturing presenting complaints, vital signs, and triage decisions recorded by specialized nursing staff. Structured clinical variables routinely collected at triage, including presenting complaint categories (CEDIS-PCL), vital signs according to the ABCDE framework, and additional structured or free-text clinical information. Results The primary outcome was the agreement between LLM-predicted and nurse-assigned ESI levels measured using quadratic-weighted Cohen's k. Secondary outcomes included sectoral assignment agreement, misclassification patterns (over- and under-triage), calibration metrics, and output reproducibility. Quadratic-weighted k values ranged from 0.18 to 0.75 across models. Only a structured stepwise prompting strategy achieved substantial agreement (k_qw = 0.747), approaching reported human inter-rater reliability. Most models demonstrated moderate or lower agreement and systematic overconfidence, with expected calibration errors (ECE) based on verbalized confidence ranging from 0.099 to 0.355. Sectoral assignment agreement (i.e. ED vs. urgent care practice, UCP) was uniformly low (k < 0.30). Reproducibility testing revealed substantial variability in 23% of cases, indicating non-deterministic output behavior for clinically relevant decisions. Conclusions Current large language models demonstrate heterogeneous and generally limited performance in real-world emergency triage tasks. Structured algorithm-guided prompting appears more influential than model architecture or size. Before clinical implementation, improvements in calibration, reliability, and workflow integration are required, alongside regulatory-compliant validation in prospective clinical settings.

14
Multinational Public Opinion on Race, Ethnicity, and Algorithmic Reform in Medicine

Adibi, A.; Le, K. X.; Pierson, E.; Diao, J. A.; Esfandiari, N.; Carlsten, C.; Sadatsafavi, M.

2026-05-21 health policy 10.64898/2026.05.15.26352687 medRxiv
Top 0.1%
3.8%
Show abstract

Importance: Several professional medical societies have removed race and ethnicity from widely used clinical algorithms with implications for millions of patients. Yet the opinions of patients and the public regarding the tensions underlying these pivotal changes have not been systematically explored. Objective: To assess global public opinion on the use of race or ethnicity in clinical algorithms, including preferences for different approaches to algorithmic reform and perceptions of alternative predictors. Design: Cross-sectional survey study. Setting: Multinational opt-in online survey conducted via Prolific in January 2026. Participants: A volunteer convenience sample with quota sampling to achieve approximately equal participation by sex at birth and across ten categories of self-identified race and ethnicity. Main Outcomes and Measures: Self-reported comfort with demographic and social predictors in clinical calculators, with net comfort defined as percentage extremely or somewhat comfortable minus percentage extremely or somewhat uncomfortable; preferences for race-specific versus race-free algorithms; perceptions of algorithmic harm or benefit. Results: Of 1,050 responses, 994 (94.7%) met eligibility criteria. Participants resided in 43 countries with a median age of 32.0 years (IQR, 26-41). Net comfort with the use of race or ethnicity in a hypothetical cancer risk calculator was +62.4% (95% CI: +57.8% to +66.9%), compared with +14.5% (95% CI: +9.1% to +19.9%) for postal or ZIP code. Overall, 87.9% (95% CI: 85.9% to 90.0%) were comfortable with race or ethnicity if a clinician explained its use and only 12.8% agreed race and ethnicity should never be used clinically. Across spirometry, kidney function, and cardiovascular risk calculators, 40.0% to 47.6% preferred race-specific versions, whereas 16.7% to 28.2% preferred race-free alternatives. Furthermore, a substantial proportion disagreed that they were well-represented by race and ethnicity categories, ranging from 22.1% for osteoporotic fracture risk equations to 42.9% for cardiovascular risk equations. These findings were consistent across countries, self-identified race and ethnicity, and among participants reporting prior experiences of racism in healthcare. Conclusions and Relevance: In our diverse multinational survey study, respondents were comfortable with the use of race and ethnicity across application areas, but often did not feel represented by existing categories and were less comfortable with the use of alternatives based on postal or ZIP codes.

15
Use of large language models by academic hospitalists: results of a multicenter survey

Bressman, E.; Auerbach, A.; Keniston, A.; Jens, C.; Ranji, S.

2026-05-29 health systems and quality improvement 10.64898/2026.05.27.26353610 medRxiv
Top 0.1%
3.7%
Show abstract

Introduction: The use of artificial intelligence (AI) by clinicians has increased rapidly in recent years, with large language models (LLMs) emerging as tools that can equal clinician diagnostic performance in simulated settings. However, limited data exist regarding physicians use of LLMs in real-world clinical practice. This study aimed to evaluate the frequency of LLM use among practicing hospitalists, identify which LLMs are most commonly utilized, and assess hospitalists' perceptions of the benefits and limitations of LLM use in clinical care. Methods: We conducted a cross-sectional survey study of academic hospital medicine faculty across 8 institutions within the Hospital Medicine Reengineering Network (HOMERuN), a collaborative research consortium. Eligible participants included hospitalists practicing within participating HOMERuN sites during the study period. The survey assessed the frequency of LLM use, types of LLMs used, clinical applications, and physician perceptions regarding usefulness, efficiency, and concerns associated with LLM adoption. Results: 170 respondents (67.1%) reported ever using an LLM in clinical practice. Among LLM users, OpenEvidence was the most used tool (88.9%), followed by ChatGPT (58.5%), Google Gemini (26.9%), and Microsoft Copilot (20.5%). Only a minority of hospitalists reported using LLMs daily while seeing patients. The most common use cases of LLMs were answering diagnostic (77.1%) and management (77.6%) questions. A majority also reported using LLMs to identify or summarize primary literature (60.0%). Lack of trust in outputs (49.8%), uncertainty around institutional policies (48.6%), and lack of access to secure applications (43.1%) were cited as the most frequent barriers to using LLMs in practice. Discussion: The use of LLMs in clinical practice is already widespread, though regular or daily use is not yet typical. Concerns regarding reliability, patient privacy, and safe integration into clinical workflows remain significant barriers to broader adoption. The responsible implementation of LLMs in hospital medicine will require addressing these barriers.

16
Large Language Model Performance in UK Advice & Guidance: A Pilot Study in Neurology

Healy, J.; Marvasti, A.; Wallace, D.; Baheerathan, A.; Ghosh, A.; Kossoff, J.; Thio, S.; Balaratnam, M.; Haider, S.; Ellershaw, S.; Dobson, R.

2026-05-18 neurology 10.64898/2026.05.13.26353081 medRxiv
Top 0.1%
3.7%
Show abstract

Background: Large language models (LLMs) demonstrate strong performance in controlled medical environments such as multiple choice exams, but their utility in real-world clinical workflows remains unproven. The NHS Advice & Guidance (A&G) service, where Primary Care clinicians can submit text-based queries to specialists, provides an environment for evaluating the clinical performance of LLMs as a specialist. Methods: We compared responses from MedGemma 4B-IT, an open-weight model deployed locally on hospital infrastructure, against specialist neurologist responses across 50 adult neurology A&G cases from University College London Hospital. Two neurologists and two GPs rated 80 blinded and 20 unblinded responses for outcome, safety, efficacy, and feasibility using standardised criteria; outcome was a binary correct/incorrect, while other domains were scored 1-5. Inter-rater reliability was assessed using intraclass correlation coefficients. Results: Although there were no statistically significant differences between blinded specialist neurologists and LLM responses across any domain (outcome: 84% vs 82%, p=0.67; safety: 3.98 vs 4.02, p=0.85; efficacy: 4.06 vs 3.98, p=0.61; feasibility: 4.39 vs 4.20, p=0.45), 10% of LLM responses received concerning scores ([&le;]2 average score) compared to 0% of human responses, indicating potentially clinically important tail risk. Furthermore, unblinded results showed a preference for human responses, with human ratings being preferred across all domains. Only 51% of binary outcomes had unanimous agreement and inter-rater agreement was moderate across other domains (ICC 0.50-0.52). Conclusions: In this pilot study, aggregate scores between blinded human and LLM responses were similar, and no statistically significant differences were detected in this exploratory sample. However, aggregate metrics masked clinically important edge-case failures in LLM responses. Pronounced inter-rater variability and the potential impact of LLM/human syntax on blinded rater judgements highlight the challenges in establishing robust evaluation frameworks for clinical LLM deployment

17
Clinical Note Comparison and Data Retrieval Via Embedding Vectors: Model Selection, Metrics, and Convergence

Dahlberg, A. C. H.; Tapiola, O.; Luisto, R.; Puranen, T.; Sanmark, E.; Vartiainen, V.

2026-05-18 health informatics 10.64898/2026.05.12.26352832 medRxiv
Top 0.2%
3.6%
Show abstract

Background: Embedding models are an integral part of generative AI architectures, transforming text into embedding vectors that represent semantic content in numerical form. Despite their central role, their performance in clinical settings remains underexplored. We evaluate embedding models across two tasks: semantic difference detection in clinical texts, and data retrieval from patient records. Methods: Eight models were applied to synthetic discharge summaries in English, Finnish, and Swedish. Semantic sensitivity was assessed by introducing controlled perturbations (deletion, modification, and paraphrasing) at three levels of severity; cosine similarity, and L1 and Euclidean distances were computed between the vectors of the original and perturbed texts. Partial vectors were compared to explore dimensionality reduction. Two models with the biggest contrast in semantic difference detection were evaluated on retrieval of relevant information from real Finnish vascular surgery records. Results: Embedding vectors captured semantic differences in clinical text: content deletion and modification produced larger increases in vector distance than paraphrasing. On average, models detected the direction of semantic change correctly, but case-level performance varied considerably. Qwen3-Embedding-8B was the only model with zero directional errors, while multilingual-E5-large erred in 13.8% of cases. In data retrieval, Qwen3-Embedding-8B again outperformed multilingual-E5-large, though the margin was narrower: sufficiency scores were 3.25 vs. 3.17 out of 5 for the first query and 2.25 vs. 1.15 out of 5 for the second query. For some models, as few as 0.6-1.2% of dimensions sufficed to replicate full-vector accuracy; principal component analysis and coordinate-level analysis did not account for this finding. Conclusions: Our results show that the choice of embedding model is important: performance differences between models can be large enough to determine whether clinically relevant information reaches the end user, and model weaknesses can be both task-specific and context-dependent.

18
When Algorithms Prescribe: A Cross-Sectional Study of Quality, Misinformation, and Engagement in Statin-Related Content on TikTok

Gharibyan, I.; Ahner, E.; Shao, R.; Sharma, D.; Navarsartian Tazehkand, T.; Diep, J.; Assoumou, B.

2026-06-08 health informatics 10.64898/2026.06.04.26354962 medRxiv
Top 0.2%
3.5%
Show abstract

Background: Statins are key to preventing atherosclerotic cardiovascular disease and lowering low-density lipoprotein cholesterol and cardiovascular events. However, skepticism regarding their safety and value persists and is increasingly influenced by social media. TikTok has emerged as a major source of health information, but its content varies in quality and accuracy. This study evaluated the quality, attitudes, misinformation, and engagement of statin-related content on TikTok. Methods: Public TikTok videos were collected using predefined search terms and coded by creator type, thematic content, and overall attitude. Video quality was assessed using the DISCERN instrument, the Patient Education Materials Assessment Tool for Audiovisual Materials, and the Global Quality Score. False or misleading claims were independently reviewed by two cardiology fellows. Associations between engagement and quality were also examined. Results: Of 1,349 screened videos, 258 met inclusion criteria. Most were educational (91.0%), with non-physician healthcare providers (34.5%) as the largest creator group. Risks or negative effects were discussed more often than benefits (63.2% vs 42.2%), and 39.5% contained at least one false or misleading claim, most often from complementary and alternative medicine providers and wellness promoters. Quality differed by creator type across all instruments, with physician-created content scoring highest. Video popularity showed minimal association with informational quality. Conclusion: Statin-related TikTok content frequently emphasizes harms, often contains misinformation, and varies substantially in quality by creator type. Greater involvement of healthcare professionals on social media may help improve digital health literacy and counter misleading information about statin therapy.

19
Combining centralized and decentralized approaches to assess and ensure data quality in Eurocrine(R) via Microsoft Power BI and DataquieR

Musholt, T. J.; Clerici, T.; Bergenfelz, A.; Schmidt, C. O.; Struckmann, S.

2026-06-05 health informatics 10.64898/2026.06.04.26354884 medRxiv
Top 0.2%
3.5%
Show abstract

Background: Medical registries have gained importance in the evaluation of healthcare quality outcomes. In the absence of high-quality evidence, such as randomized controlled trials, studies based on registry data are essential for informing clinical guidelines. Methods for assessing data quality are rarely described in detail. To ensure the credibility of registry-based studies, registries must use all available technical and operational means to guarantee high data quality. Method: Eurocrine(R) is a pan-European endocrine surgical database and quality registry initially funded by the EU healthcare programme, which started in 2015 and now includes more than 200,000 interventions as of April 2025. To ensure high data quality, interactive and standardized reports are created via Microsoft Power BI, which are created both centrally and locally. In addition, comprehensive data quality analyses were performed via the R-based package dataquieR. Results: Although a multitude of technical measures (for example, input screen design and real-time plausibility checks during data entry) are in place, they are not sufficient to prevent human errors at data entry. Errors identified in the reports were corrected, and preventive measures were implemented. Overall, the data quality was assessed as very good in terms of completeness, accuracy, and consistency. Conclusion: It is very important to provide registry users with an efficient and smart tool to identify data issues, as they have the clinical information to correct them. Data quality reports generated with dataquieR represent an effective tool for registry administrators. Predesigned Microsoft Power BI reports enable participating Eurocrine(R) clinics to self-audit their data.

20
Cancer care disruption during the COVID-19 pandemic in Ontario, Canada: A sequential mixed-methods study

Timilshina, N.; Jacobson, D.; Birze, A.; Wodchis, W. P.; Kuluski, K.; Strumpf, E.; Ammi, M.

2026-06-12 health systems and quality improvement 10.64898/2026.06.10.26355360 medRxiv
Top 0.2%
3.1%
Show abstract

Introduction The COVID-19 pandemic profoundly disrupted healthcare delivery worldwide, with cancer care among the most affected services. Prior studies documented delays in referrals, reduced specialist access, and increased provider burden. However, the extent to which these experiences were reflected at the system level remains unclear. Objective To document cancer care experiences and examine whether these experiences were reflected in population-level health system indicators across Ontario, Canada. Methods We used an exploratory sequential mixed-methods design. Qualitative data were collected through focus groups and semi-structured interviews with 32 participants, including patients with cancer (n=8), caregivers (n=5), healthcare providers (n=14), and decision-makers (n=5) across two hospital settings in Ontario, Canada. Emergent themes informed the development of quantitative indicators. We then conducted a retrospective population-based analysis of linked administrative health databases for cancer patients in Ontario (n=87,786) to assess the prevalence of identified themes. Results Four themes emerged: (I) delays in diagnosis and screening; (II) disrupted access to primary care; (III) barriers to specialist and mental health services; and (IV) fragmented care for patients with multimorbidity. Quantitative findings corroborated major themes. Screening rates declined for cervical (64.8% to 57.5%) and breast cancer (64.5% to 57.2%). While in-person primary care shifted almost entirely to virtual modalities (8.5% to 95.4%), overall visit volumes remained stable. Specialist care showed uneven patterns, with increased oncology visits but declines in cardiology and mental health services. Patients with multiple comorbidities experienced the largest reductions in non-oncology specialist care. Conclusion The pandemic disrupted key components of cancer care, particularly screening, access to certain specialist services, and care for patients with complex needs. Integrating qualitative and quantitative evidence highlights areas of system vulnerability and underscores the need for coordinated, resilient cancer care capable of maintaining essential services during future crises.